The aim of this script is to perform a statistical analysis of all untargeted metabolomics datasets which has been previously pre-treated with the scrip “Prepare_Datasets_to_Metabolomic_Analysis.Rmd”. It performs several tasks to achieve this:
Reads the raw data files from the input folder.
Normalize data set with the package “NormalyzerDE” without needing a design input (it is created in the moment with the information given in the input files).
After the review and evaluation of the normalization report by the researcher, the selected data normalized is upload by changing line 138 with the correspondent file. You can choose between the following options: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt. If you do not want to use any normalized file, set it as “None”.
Data visualization before and after the normalization with box plot from “NormalyMets” package.
Principal Component Analysis (PCA) and Hierarchical Cluster Analysis (HCA) from packages “FactoMineR” and “factoextra” of normalized dataset to verify the correct group differentiation and a possible batch effect.
Statistical analysis using the “Limma” package performed to extract the p-values and log2 fold change values (LFC) for each metabolite in the conducted comparative analysis. Additionally, p-value adjustment methods such as Benjamini-Hochberg, Bonferroni, and q-values were applied. Example tables of the result datasets, as well as plots displaying the original p-values and q-values, are available for visualization.
The identification of significant metabolites will be based on the statistical thresholds set in point 4 of the usage instructions. A message will be displayed indicating the number of decreased and increased metabolites for each adjustment method, taking into account the linear model used in the Limma analysis. This will provide insights into the directionality of differential expression for the identified metabolites. [As the contrast model has been set up as _“contrast <- paste0(group[2],”-“,group[1])”, _it will represent the difference in expression levels between Group 2 and Group 1. A positive LFC indicates higher expression in Group 2 compared to Group 1, while a negative LFC indicates higher expression in Group 1 compared to Group 2. The magnitude of the LFC represents the change in expression between the two groups.] That number will be represented by Venn-diagrams (“ggVennDiagram” package) and upset plot (“UpSetR” package).
After having selected a p-value adjustment method in line 660, the significant metabolites based on the statistical thresholds will be plotted by Mean-Average plots and Volcano plots, displaying also a dynamic volcano plot where you can see the information of each dot (metabolite). Tables with the 10 most significant metabolites for each comparative will be display.
In the final step, a sparse Partial Least Squares Discriminant Analysis (sPLS-DA) will be conducted using the “mixOmics” package. This analysis will utilize 2 components and 25 variables. The results of the sPLS-DA analysis will include a loadings plot, individual samples plot, variable plot, and a prediction essay. Additionally, predictions results and the top 10 most important variable predictors for each comparative will be presented in tables, providing valuable insights into the predictive power of the model and the variables driving the differences between the comparatives.
To run this script, you need to provide the input and output folder addresses, as well as the statistical thresholds.
rm(list = ls())
Input data must proceed from the scrip “Prepare_Datasets_to_Statistical_Analysis.Rmd”, because they must have the words “one_factor” in their name.
You will need the following packages
packages <- c("tidyverse", "plotly", "mixOmics", "devtools", "NormalizeMets", "tools", "kableExtra",
"qvalue", "NormalyzerDE", "limma", "FactoMineR", "factoextra", "ggVennDiagram", "UpSetR", "ggpubr",
"ggrepel")
for (package in packages) {
if (!requireNamespace(package, quietly = TRUE)) {
if (package == "NormalizeMets") {
devtools::install_github("metabolomicstats/NormalizeMets")
} else {
install.packages(package)
}
}
library(package, character.only = TRUE)
}
input_folder <- "../../1_Data/Comparative_G/Untargeted_metabolomic_data/Ready_to_use"
# Set Normalize method on line 114 #
# Set the statistical parameters for the significant analysis
stats_values <- list()
LFC_limit <- "LFC_limit"; stats_values[[LFC_limit]] <- 2
alpha_value <- "alpha_value"; stats_values[[alpha_value]] <- 0.05
files <- list.files(input_folder)
files <- files[grepl("_G", files, ignore.case = T)]
This code section iterates over a list of files, reads each file’s data, prepares the dataset by assigning row names and extracting group information, writes the design and raw data to separate files, applies the normalyzer function for normalization, saves the normalized result, and finally removes temporary files.
for (i in files){
# Read data files
data_path <- file.path(input_folder, i)
raw_data <- read.csv(data_path)
# Prepare dataset
rownames(raw_data) <- raw_data$name
group <- as.character(raw_data[1, -1])
raw_data <- raw_data[-1, -1]
# Design and temporal objects
design <- data.frame(sample=colnames(raw_data), group=group)
write.table(x = design, file = paste0(input_folder, "/", sub("_.*.?", "", i), "_design.tsv"),
quote = F, row.names = F, sep = "\t")
write.table(x = raw_data, file = paste0(input_folder, "/", sub("_.*.?", "", i), "_tmp_data.tsv"),
quote = F, row.names = F, sep = "\t")
# NormalyzerDE aplication
suppressMessages(normalyzer(jobName = paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
designPath = paste0(input_folder, "/", sub("_.*.?", "", i), "_design.tsv"),
dataPath = paste0(input_folder, "/", sub("_.*.?", "", i), "_tmp_data.tsv"),
outputDir = input_folder))
cat(sprintf("File %s has been Normalyzed and the result has been saved in %s.\n", i,
paste0(input_folder, "/", sub("_.*.?", "", i), "_NormalyzerDE\n")))
unlink(paste0(input_folder, sub("_.*.?", "", i), "_design.tsv"), recursive = T)
unlink(paste0(input_folder, sub("_.*.?", "", i), "_tmp_data.tsv"), recursive = T)
}
For the example data that was used to optimize this script and develop the corresponding Final Master Project, you can refer to the normalization summaries in the following files.
Normalized summary of Global OTA analysis
# You can choose between the followings data normalized files: CycLoess-normalized.txt, GI-normalized.txt, log2-normalized.txt, mean-normalized.txt, median-normalized.txt, Quantile-normalized.txt, RLR-normalized.txt, VSN-normalized.txt.
# If you do not want to use any normalized file set "None"
norm_selected <- "/CycLoess-normalized.txt"
if (norm_selected != "None") {
for (i in files){
name <- paste0(sub("_.*.?", "", i), "_normalized")
data <- read.table(paste0(input_folder, "/", paste0(sub("_.*.?", "", i), "_NormalyzerDE"),
norm_selected), header = T)
# Rename rows according to their original names
data2 <- read.csv(paste0(input_folder, "/", i))
rownames(data2) <- data2$name
group <- data2[1, -1] %>% rownames_to_column(); data2 <- data2[-1, -1]
row.names(data) <- row.names(data2)
# Reestructure the dataset by rearranging the columns
data <- rownames_to_column(data, var = "name")
data <- rbind(as.character(group), data)
data[is.na(data)] <- 0
assign(name, data)
write.csv(data, file = paste0(input_folder, "/", paste0(sub("_.*.?", "", i), "_normalized.csv")), row.names = FALSE)
cat(sprintf("File %s normalized by %s method has been saved in %s.\n", i, sub(".*/(.*?)\\-.*", "\\1", norm_selected),
paste0(input_folder, "/", paste0(sub("_.*.?", "", i), "_normalized.csv \n"))))
}
files2 <- ls()
files2 <- files2[grepl("normalized", files2, ignore.case = T)]
for (i in files){
# Set the new name for the normalized data that will be analysed
name <- paste0(sub("_.*.?", "", i), "_no_normalized")
data_path <- file.path(input_folder, i)
data <- read.csv(data_path)
assign(name, data)
files <- ls()
files <- files[grepl("no_normalized", files, ignore.case = T)]
}
} else {
for (i in files){
# Set the new name for the normalized data that will be analysed
name <- paste0(sub("_.*.?", "", i), "_no_normalized")
data_path <- file.path(input_folder, i)
data <- read.csv(data_path)
assign(name, data)
files <- ls()
files <- files[grepl("no_normalized", files, ignore.case = T)]
}
}
## File Comparative_G.csv normalized by CycLoess method has been saved in ../../1_Data/Comparative_G/Untargeted_metabolomic_data/Ready_to_use/Comparative_normalized.csv
## .
plots1 <- list()
plots2 <- list()
if (norm_selected != "None") {
for (i in files){
data <- get(i)
# Prepare dataset without normalization
group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
met_names <- data$name %>% .[-1]
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
# Original data visualization (log transformed and numeric format)
plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F,
plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
savetype = "png", interactiveplot = T)
index <- length(plots1) + 1
plots1[[index]] <- plot
}
for (i in files2){
data <- get(i)
# Prepare dataset normalized
group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
met_names <- data$name %>% .[-1]
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1]
data <- data %>% t() %>% `colnames<-`(met_names)
# Normalized data visualization (log transformed and numeric format)
plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F,
plotname = paste0("BoxPlot of ", sub("_.*.?", "", i),
" normalized by ", sub(".*/(.*?)\\-.*", "\\1", norm_selected), " method"),
savetype = "png", interactiveplot = T)
index <- length(plots2) + 1
plots2[[index]] <- plot
}
} else {
for (i in files){
data <- get(i)
# Prepare dataset without normalization
group <- data[1, ] %>% `row.names<-`(.[, 1]) %>% .[, -1] %>% t()
met_names <- data$name %>% .[-1]
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, -1] %>% LogTransform(. ,zerotona=TRUE)
data <- data$featuredata %>% t() %>% `colnames<-`(met_names)
# Original data visualization (log transformed and numeric format)
plot <- RlaPlots(data, group, cex.axis = 0.6, saveplot = F,
plotname = paste0("BoxPlot of ", sub("_.*.?", "", i), " no normalized"),
savetype = "png", interactiveplot = T)
index <- length(plots2) + 1
plots2[[index]] <- plot
}
}
if (norm_selected == "None") {
for (i in files) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,])
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# PCA performance
tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
## Boxplot
tmp_eig <- tmp_pca$eig %>% .[1:5, ]
barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
main = paste("PCA of", i, "no normalized"),
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="gold")
## Graph of individuals
plot <- fviz_pca_ind(tmp_pca, col.ind = group,
pointsize=1, pointshape=21,fill="black",
repel = T,
addEllipses = T, ellipse.type = "confidence",
legend.title ="Conditions",
show_legend = TRUE, show_guide = TRUE)
labels <- rownames(data)
plot <- plot +
ggtitle(paste("PC1 & PC2 of", i, "no normalized")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0("../../Images/Global_OTA_comparative/PCA_plot_of_", i, ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
} else {
for (i in files2) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,]); names <- colnames(data)
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# PCA performance
tmp_pca <- PCA(data, graph = FALSE, scale.unit = TRUE, quali.sup = 1)
## Boxplot
tmp_eig <- tmp_pca$eig %>% .[1:5, ]
barplot(tmp_eig[, 2], names.arg=1:nrow(tmp_eig),
main = paste("PCA of", i, "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"),
xlab = "Principal Components",
ylab = "Percentage of variances",
col ="gold")
## Graph of individuals
plot <- fviz_pca_ind(tmp_pca, col.ind = group,
pointsize=1, pointshape=21,fill="black",
repel = T,
addEllipses = T, ellipse.type = "confidence",
legend.title ="Conditions",
show_legend = TRUE, show_guide = TRUE)
labels <- rownames(data)
plot <- plot +
ggtitle(paste("PC1 & PC2 of", i, "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0("../../Images/Global_OTA_comparative/PCA_plot_of_", i, ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
}
if (norm_selected == "None") {
for (i in files) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,])
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# HCA performance
res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)
plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet",
label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
type="rectangle", labels_track_height = 1000, horiz = F, repel = T) +
ggtitle(paste("Cluster Dendrogram of", i, "no normalized")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0("../../Images/Global_OTA_comparative/HCA_plot_of_", i, ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
index <- length(plots4) + 1
plots4[[index]] <- plot
}
} else {
for (i in files2) {
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
group <- as.character(data[1 ,]); names <- colnames(data)
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% t()
# HCA performance
res.pca <- PCA(data, graph = FALSE,scale.unit = TRUE,quali.sup = 1)
res.hcpc <- HCPC(res.pca, graph=FALSE,nb.clust = 2)
plot <- fviz_dend(res.hcpc, k = NULL, cex = 0.7, palette = "lancet",
label_cols = "black", rect = T, rect_fill = T, rect_border = "lancet",
type="rectangle", labels_track_height = 1000, horiz = F, repel = T) +
ggtitle(paste("Cluster Dendrogram of", i, "normalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0("../../Images/Global_OTA_comparative/HCA_plot_of_",i, ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
}
The p-value is a measure used in statistical hypothesis testing to determine the significance of a comparison. When the p-value is smaller than a significance level “α”, it indicates that there is sufficient evidence to reject the null hypothesis (H0). Most usually “α” value use to be 0.05 (5%) or 0.01 (1%).
In this research, where “α” = 0.05, the null hypothesis suggests that there are no significant differences between the amounts of the metabolite in each condition. By rejecting the H0 and accepting the alternative hypothesis (H1), we conclude that there are statistically significant differences in the amounts of the metabolite between the conditions being compared.
* H0 —> there is no difference between the sample groups for the amounts of the metabolites. * H1 —> There is a difference between the sample groups for the amounts of the metabolites.
This indicates a significant difference in the metabolite’s amount between two different conditions. By fixing an LFC value of 1 or -1, we will consider as statistically significant only those metabolites that have a difference in amount of at least two-fold or greater in either direction.
If we fix an LFC value of 2 or -2, we will be more restrictive and consider only those metabolites that have a difference in amount of at least four-fold or greater in either direction. This approach will capture larger and more pronounced changes in metabolite levels.
files2 <- vector()
for (i in files){
# Dataset set-up
data <- get(i) %>% `row.names<-`(., .[["name"]]) %>% .[-1]
met_names <- row.names(data) %>% .[-1]
group_rep <- as.character(data[1 ,]) %>% table(.)
group <- as.character(data[1 ,]) %>% unique(.)
data <- as.data.frame(lapply(data, as.numeric)) %>% .[-1, ] %>% `row.names<-` (., met_names)
# Prepare limma assay
experimental.design <- model.matrix(~ -1+factor(c(rep(1, group_rep[[1]]),rep(2, group_rep[[2]])))) %>%
`colnames<-`(., group)
# Limma assay (condition 2 vs. condition 1)
linear.fit <- lmFit(data, experimental.design)
contrast <- paste0(group[2],"-",group[1])
contrast.matrix <- makeContrasts(contrast = contrast, levels = group)
contrast.linear.fit <- contrasts.fit(linear.fit, contrast.matrix)
contrast.results <- eBayes(contrast.linear.fit)
# New dataframe with stats
data_stats <- topTable(contrast.results, number = nrow(data), coef = 1)
new_name <- paste0(sub("_.*.?", "", i), "_stats")
assign(new_name, data_stats)
files2 <- append(files2, new_name)
output <- paste(rownames(data_stats[1, ]), "has the following statistics",
paste(colnames(data_stats), data_stats[1, ], sep = ": ", collapse = ", "))
cat(sprintf("The file %s has been processed using the Limma package, comparing condition %s against condition %s, and the statistical information obtained has been saved in the file %s. Here is an example of the information it contains: metabolite %s\n", i, group[2], group[1], new_name, output))
cat("\n")
}
## The file Comparative_normalized has been processed using the Limma package, comparing condition OTA against condition noOTA, and the statistical information obtained has been saved in the file Comparative_stats. Here is an example of the information it contains: metabolite Choline has the following statistics logFC: 15.0220617799093, AveExpr: 20.6805497584677, t: 129.161930185515, P.Value: 5.61343605955094e-57, adj.P.Val: 1.80191297511585e-54, B: 110.433652987861
cat(sprintf("Vector files2 has the following information %s.\n", paste(files2, collapse = ", ")))
## Vector files2 has the following information Comparative_stats.
Multiple comparisons refer to the situation where several pairwise comparisons are made within a dataset or experiment. When conducting multiple comparisons, the probability of obtaining false positive results (Type I errors) increases. To address this issue, various statistical methods can be used to adjust the significance level (p-value) or control the overall error rate.
It is a statistical concept that refers to the probability of making at least one Type I error (rejecting a true null hypothesis) among a set of multiple hypothesis tests. The FWER is a measure that quantifies the overall error rate when multiple hypotheses are tested simultaneously.
It involves dividing the significance level (p-value) by the number of comparisons performed. For example, if there are 10 comparisons being made and a desired overall significance level of 5%, the p-value threshold would be adjusted to 0.005 (0.05 divided by 10). If the p-value obtained in a comparison is lower than this adjusted threshold, it is considered statistically significant. This method adjusts the significance level for each individual hypothesis to maintain the desired FWER.
The “BH” (aka “fdr”) and “BY” methods of Benjamini, Hochberg, and Yekutieli control the false discovery rate (FDR), the expected proportion of false discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition than the family-wise error rate, so these methods are more powerful than the others.
The Benjamini-Hochberg method is another technique used to control the type I error in multiple comparisons. Unlike Bonferroni, B-H is less conservative and can be more powerful in detecting statistically significant differences. Instead of adjusting each individual p-value, B-H controls the expected proportion of false discoveries among the rejected hypotheses. This is achieved by ordering the p-values from lowest to highest, calculating a critical value based on the desired FDR rate, and comparing it to each p-value in order. P-values below the critical value are considered statistically significant.
Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.
P-VALUE ADJUSTMENT
for (i in files2) {
message(paste("Processing file:", i))
data <- get(i)
data$p_value_bonferroni <- p.adjust(data$P.Value, method = "bonferroni", n = length(data$P.Value))
names(data)[names(data) == "adj.P.Val"] <- "p_value_BH"
new_name <- paste0("q_value_", sub("_.*.?", "", i))
tmp <- data$P.Value
if (i == "compC_stats"){
message(paste("Error in qvalue calculation for file:", i))
tmp_q_value <- qvalue(tmp, pi0 = 1)
message(paste("File", i, "processed successfully with pi0 = 1"))
} else {
tmp_q_value <- qvalue(tmp)
message(paste("File", i, "processed successfully"))
}
data$q_value <- tmp_q_value$qvalues
assign(new_name, tmp_q_value)
assign(i, data)
}
## Processing file: Comparative_stats
## File Comparative_stats processed successfully
| logFC | AveExpr | t | P.Value | p_value_BH | B | p_value_bonferroni | q_value | |
|---|---|---|---|---|---|---|---|---|
| Choline | 15.02206 | 20.68055 | 129.16193 | 5.6e-57 | 1.8e-54 | 110.43365 | 1.8e-54 | 1.8e-54 |
| C7H16ClN2PS2 | -11.66071 | 16.32879 | -97.15851 | 1.0e-51 | 1.6e-49 | 102.54360 | 3.3e-49 | 1.6e-49 |
| Nicotinicacid | -12.39273 | 20.48469 | -86.97612 | 1.1e-49 | 1.2e-47 | 99.06696 | 3.6e-47 | 1.2e-47 |
| (9Z)-Tetradecenoicacid | -13.29419 | 16.71024 | -84.81134 | 3.3e-49 | 2.6e-47 | 98.24711 | 1.1e-46 | 2.6e-47 |
| Cholinesulfate | 11.39834 | 19.67390 | 82.78929 | 9.1e-49 | 5.9e-47 | 97.45287 | 2.9e-46 | 5.9e-47 |
| MFCD00042968 | -6.63682 | 16.29040 | -76.07777 | 3.3e-47 | 1.6e-45 | 94.60198 | 1.1e-44 | 1.6e-45 |
P-VALUE REPRESENTATION
for (i in files2){
# Q-value plots:
q_value_plot <- paste0("q_value_", sub("_.*.?", "", i))
q_value_plot <- get(q_value_plot)
plot(q_value_plot, main = paste("Q-value plots of file", i))
# P-values histograms:
data <- get(i)
hist(data$P.Value, main = paste("Original p-value of file", i),
xlab = "p-value", col = "gold")
}
## SIGNIFICANT METABOLITES in file Comparative_stats with a LFC value of 2 and a α value of 0.05.
##
## The total number of metabolites is 321.
##
## The significant metabolites in file Comparative_stats by original p-values obteined by limma are 71.
## - 23 are increased.
## - 48 are decreased.
## - The most significant metabolite increased in OTA condition is Choline.
## - Whereas the most significant metabolite decreased in OTA condition is C7H16ClN2PS2.
##
## The significant metabolites in file Comparative_stats by Bonferroni p-values are 71.
## - 23 are increased.
## - 48 are decreased.
## - The most significant metabolite increased in OTA condition is Choline.
## - Whereas the most significant metabolite decreased in OTA condition is C7H16ClN2PS2.
##
## The significant metabolites in file Comparative_stats by Benjamini-Hochberg p-values are 71.
## - 23 are increased.
## - 48 are decreased.
## - The most significant metabolite increased in OTA condition is Choline.
## - Whereas the most significant metabolite decreased in OTA condition is C7H16ClN2PS2.
##
## The significant metabolites in file Comparative_stats by q-value testare 71.
## - 23 are increased.
## - 48 are decreased.
## - The most significant metabolite increased in OTA condition is Choline.
## - Whereas the most significant metabolite decreased in OTA condition is C7H16ClN2PS2.
##
## |------------------------|
for (i in files2){
data <- get(i)
# Obtain the significant metabolites by original p-values
significant <- subset(data, (logFC > stats_values$LFC_limit & P.Value < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & P.Value < stats_values$alpha_value))
Original_mets <- row.names(significant)
# Obtain the significant metabolites by BH p-values
significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_BH < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value))
BH_mets <- row.names(significant)
# Obtain the significant metabolites by original p-values
significant <- subset(data, (logFC > stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & p_value_bonferroni < stats_values$alpha_value))
Bonferroni_mets <- row.names(significant)
# Obtain the significant metabolites by q-values
significant <- subset(data, (logFC > stats_values$LFC_limit & q_value < stats_values$alpha_value) |
(logFC < -stats_values$LFC_limit & q_value < stats_values$alpha_value))
Qvalue_mets <- row.names(significant)
significant <- list(Original_mets = Original_mets, BH_mets = BH_mets, Bonferroni_mets = Bonferroni_mets,
Qvalue_mets = Qvalue_mets)
# Performance Venn diagram
random_color <- sample(colors(), 1)
plot <- ggVennDiagram(significant, label_alpha = 0.4) +
scale_fill_gradient(low="white", high = random_color) + labs(fill = "Significant \nmetabolites") +
ggtitle(paste("Venn Diagram of file", i,"with the significant \nmetabolites according to the p-value used")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
print(plot)
# Upset diagram
plot <- upset(fromList(significant),
number.angles = 0, point.size = 3, line.size = 1,
sets.x.label = "Number of significant \n metabolites",
mainbar.y.label = paste("Upset Diagram of file", i,
"\nwith the significant metabolites \naccording to the p-value used
\n\nIntersection Size"),
set_size.show = T, matrix.color = random_color,
set_size.scale_max = max(sapply(significant, length))+50,
text.scale = c(1.5, 1.5, 1, 1, 1.4, 2), order.by = "freq")
print(plot)
}
# Set the p_value adjustment method selected to use in the following steps. You can choose between: p_value_BH, p_value_bonferroni, q_value and P.Value
selected_p_adjustment <- "p_value_BH"
## Create a function that bind columns and complete the missing rows with NA values
cbind.fill <- function(...){
nm <- list(...)
nm<-lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x) {
rbind(x, matrix(NA, n-nrow(x), ncol(x)))
}))
}
for (i in files2) {
# Prepare datasets
data <- get(i)
names(data) <- gsub("logFC", "log2FoldChange", gsub("AveExpr", "baseMean", gsub("p_value_BH", "padj", names(data))))
data <- data[c("log2FoldChange", "baseMean", "padj")]
# Save significant metabolites as .csv file, filling with NA values till the end of the dataframe.
## Select the desired metabolites to save
increased <- subset(data, log2FoldChange > stats_values$LFC_limit & padj < stats_values$alpha_value) %>%
rownames_to_column() %>% .[, 1] %>% as.data.frame()
decreased <- subset(data, log2FoldChange < - stats_values$LFC_limit & padj < stats_values$alpha_value) %>%
rownames_to_column() %>% .[, 1] %>% as.data.frame()
## Save data
significant_data <- cbind.fill(increased, decreased) %>% `colnames<-`(lm_contrast[sub("_.*.?", "", i)][[1]])
write.csv(significant_data, file = paste0(input_folder, "/Significant_mets.csv"), row.names = FALSE)
# Performance MA plot
plot <- ggmaplot(
data, fdr = stats_values$alpha_value, fc = stats_values$LFC_limit,
genenames = row.names(data), detection_call = NULL, size = 1.5,
alpha = 1, seed = 42, font.label = c(10, "italic", "black"),
label.rectangle = F, palette = c("#B31B21", "#1465AC", "darkgray"),
top = 4, select.top.method = c("padj", "fc"), label.select = NULL,
main = NULL, xlab = "Log2 mean expression", ylab = "Log2 fold change",
ggtheme = theme_pubr()) +
ggtitle(paste0("MA plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
" Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold"))
ggsave(filename = paste0("../../Global_OTA_comparative/Images/MA_plot_of_", sub("_.*.?", "", i), ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
print(plot)
}
j = 0
plots <- list()
tables1 <- list()
RDS_files <- list()
for (i in files2){
data <- get(i) %>% rownames_to_column("Metabolites") %>% .[c("Metabolites", "logFC", "p_value_BH")]
data <- data %>%
mutate(Significant = ifelse(abs(logFC) < stats_values$LFC_limit & p_value_BH > stats_values$alpha_value, "ns",
ifelse(logFC >= stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "up",
ifelse(logFC <= -stats_values$LFC_limit & p_value_BH < stats_values$alpha_value, "down",
"ns"))))
# Make a table with the 10 most significant metabolites to each condition
mets <- bind_cols(data %>% arrange(desc(Significant)) %>% head(10) %>% .$Metabolites,
data %>% arrange(Significant) %>% head(10) %>% .$Metabolites) %>%
`colnames<-`(c(lm_contrast[sub("_.*.?", "", i)][[1]][1], lm_contrast[sub("_.*.?", "", i)][[1]][2]))
## Save data set in a list with the correspondence name
RDS_files[[sub("_.*.?", "", i)]] <- mets
j <- j+1
table <- mets %>%
kbl(caption = paste("Table", j,": The 10 most significant metabolites to each condition of",
sub("_.*.?", "", i), "are: ")) %>%
kable_classic(full_width = F, html_font = "Cambria")
index <- length(tables1) + 1
tables1[[index]] <- table
# Make a dataframe with the most significant metabolites, in order to fix them in the volcano plot
mets <- list()
mets <- bind_rows(data %>% arrange(desc(Significant)) %>% head(5),
data %>% arrange(Significant) %>% head(5))
plot <- data %>%
ggplot(aes(x=logFC, y=-log10(p_value_BH), color=Significant,
text=paste0("</br>Metabolite: ", Metabolites,
"</br>Adj p-value: ", p_value_BH,
"</br>LFC: ", logFC))) +
geom_point(alpha=0.75, shape=16) + xlim(-15,15) +
theme_gray() + theme(legend.position = "bottom") +
ggtitle(paste0("Volcano plot of file ", i, " (", lm_contrast[sub("_.*.?", "", i)][[1]][1],
" Vs. ",lm_contrast[sub("_.*.?", "", i)][[1]][2], ")")) +
theme(plot.title = element_text(hjust = 0.5, color="black", size=14, face="bold")) +
labs(color = "Significant \nMetabolite") +
scale_color_manual(values = c("up" = "#faaa11", "down" = "#26b3ff", "ns" = "grey")) +
geom_point(data = mets, aes(color = Significant), size = 4) +
geom_text_repel(data = mets, aes(label = Metabolites), size = 3, color = "black")
print(plot)
ggsave(filename = paste0("../../Global_OTA_comparative/Images/Volcano_plot_of_", sub("_.*.?", "", i), ".png"),
plot = plot, width = 8, height = 6, dpi = 300)
index <- length(plots) + 1
plots[[index]] <- plot
}
## New names:
## • `` -> `...1`
## • `` -> `...2`
#saveRDS(RDS_files, file = "../../../Tables_TFM/global_significant_mets.rds")
tables1[[1]]
| OTA | noOTA |
|---|---|
| Choline | C7H16ClN2PS2 |
| Cholinesulfate | Nicotinicacid |
| MFCD00053868 | (9Z)-Tetradecenoicacid |
| 1,5-Pentanedithiol | MFCD00042968 |
| Stachydrine | p-dicyclohexylbenzene |
| 2-Oxovalericacid | Pestarhamnose B |
| 4-O-methylmelleolide | C6H16N3P3 |
| (R)-2_3-Dihydroxy-3-methylbutanoate | N-(2-Amino-2-oxoethyl)-N’-cyclododecylsuccinamide |
| 3-Butyn-1-al | 2-Methylserine |
| Furfuranol | C8H14N10S2 |
PLS-DA is a statistical method used for classification. It helps to predict or classify samples into different groups based on a set of variables. It is particularly useful for high-dimensional data like metabolomics. By applying PLS-DA, you can identify the important variables that distinguish between different groups, providing valuable insights for further analysis.
In PLS-DA, the method combines the principles of Partial Least Squares (PLS) regression and Discriminant Analysis. It aims to find a linear combination of the predictor variables that maximally explains the variation in the response variable while also maximizing the separation between different groups or categories.
In this case, given the large number of variables available, we decided to perform a sparse PLS-DA analysis. We limited the analysis to 2 components based on the good separation achieved in the PCA assay. Additionally, we reduced the number of variables to 100.
tables1 <- list()
tables2 <- list()
RDS_files <- list()
RDS_files1 <- list()
for (i in files) {
j <- j + 1
data <- get(i)
Y <- as.character(data[1 ,]) %>% .[-1] %>% as.factor(.)
mets <- (data[, 1]) %>% .[-1] %>% as.data.frame() %>% `colnames<-`(., "Mets_names") %>%
mutate(ID = paste0("Metabolite_", 1:(nrow(data)-1)))
write.csv(mets, file = paste0(input_folder, "/CompG_mets_names.csv"), row.names = FALSE)
data <- data[-1, -1] %>% mutate_all(as.numeric) %>% t() %>% `colnames<-`(., mets$ID)
# Prepare data for the sPLS-DA
## Remove variables with zero standard deviation
sd_values <- apply(data, 2, sd)
filtered_data <- data[, sd_values > 0] %>% as.data.frame(); X <- filtered_data
# Perform sPLS-DA
tmp_pls <- mixOmics::splsda(X, Y, ncomp = 2,
keepX = 25,
scale = FALSE)
# Prediction capacity
## Obtain the mean sd value calculated per sample in the dataset
mean_sd <- mean(apply(X, 1, sd))
## Randomly select 5 observations from the X matrix
set.seed(123)
s <- sample(1:nrow(X), 5)
Xnew = X[s,,drop = FALSE]
## Modify original samples
Xnew = t(apply(Xnew, 1, function(x){x + rnorm(ncol(X), 0, mean_sd/10)}))
##
myprediction = predict(tmp_pls, Xnew, dist = "mahalanobis.dist")
## Save data set in a list with the correspondence name
RDS_files[[sub("_.*.?", "", i)]] <- myprediction$class
table <- myprediction$class %>%
kbl(caption = paste("Table", j,": The predicted classes for the randomly and edited samples of",
sub("_.*.?", "", i), "are: ")) %>%
kable_classic(full_width = F, html_font = "Cambria")
index <- length(tables1) + 1
tables1[[index]] <- table
# The 10 most important predictor variables selection to each group
## Those with positive weight in component 1 -> group_1
group_1 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(desc(comp1)) %>%
head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
## Those with negative weight in component 1 -> group_2
group_2 <- tmp_pls$loadings$X %>% as.data.frame() %>% arrange(comp1) %>%
head(10) %>% row.names() %>% {mets$Mets_names[match(., mets$ID)]}
## If the sign of the weight for group 1 in the first component is negative, it means that the metabolites associated with group 1 by the sPLS-DA technique will have a negative weight. On the other hand, if the sign of the weight for group 1 is positive, it indicates that the metabolites associated with group 1 by the sPLS-DA technique will have a positive weight.
if (sign(tmp_pls$loadings$Y[1,1]) < 0){
group <- tmp_pls$loadings$Y %>% row.names()
} else {
group <- tmp_pls$loadings$Y %>% row.names() %>% rev()
}
VIP_metabolites <- cbind(group_2, group_1) %>% `colnames<-`(., group)
table <- VIP_metabolites %>%
kbl(caption = paste("Table", j+1,": The 10 most important predictor \nvariables of",
sub("_.*.?", "", i), "are: ")) %>%
kable_classic(full_width = F, html_font = "Cambria")
index <- length(tables2) + 1
tables2[[index]] <- table
# Save PLS variables loadings in RDS files
## Change ID by their metabolite name
global_pls_loadings <- tmp_pls$loadings$X %>% .[, 1] %>% as.data.frame()
new_row_names <- rownames(global_pls_loadings) %>% {mets$Mets_names[match(., mets$ID)]}
rownames(global_pls_loadings) <- new_row_names
## Split the data frame in a positive and negative one. Without 0 weight
positive_weight <- subset(global_pls_loadings, . > 0) %>% arrange(.) %>% `colnames<-`(., group[2])
negative_weight <- subset(global_pls_loadings, . < 0) %>% arrange(.) %>% `colnames<-`(., group[1])
## Save both data set in a list with the correspondence name
list_1 <- list()
list_1[["positive_weight"]] <- positive_weight; list_1[["negative_weight"]] <- negative_weight
RDS_files1[[sub("_.*.?", "", i)]] <- list_1
# GRAPHS
## Loadings
plotLoadings(tmp_pls, comp = 1, method = 'mean', contrib = 'max',
title = paste("Loadings of the 100 principal \nmetabolites of", sub("_.*.?", "", i)))
## Variables (Metabolites)
plotVar(tmp_pls, comp = c(1,2),
var.names = F, cex = 3, cutoff = 0.8,
title = paste("Correlation plot of the 100 \nprincipal metabolites of", sub("_.*.?", "", i)))
## HeatMap
#Execute directly in console
X11()
cim(tmp_pls, comp = 1,
xlab = "Metabolites", ylab = "Samples", margins = c(7, 15),
title = paste("Heatmap of", sub("_.*.?", "", i),
"\nnormalized by", sub(".*/(.*?)\\-.*", "\\1", norm_selected), "method"), lwid = c(0.1, 0.9))
dev.off()
## Individual samples per component
my_colors <- c("steelblue3", "green3")
plotIndiv(tmp_pls, comp = 1:2,
group = Y, col = my_colors, style = "ggplot2",
title = paste("PC1 & PC2", i),
legend = TRUE, legend.position = "right",
legend.title = "Sampling condition", ellipse = TRUE,
ellipse.level = 0.95, centroid = FALSE)
}
#saveRDS(RDS_files, file = "../../../Supplementary_material/Supplementary_tables/Global_predictions_results.rds")
#saveRDS(RDS_files1, file = "../../../Tables_TFM/GLobal_loadings_pls_comparatives.rds")
|
| OTA | noOTA |
|---|---|
| Choline | 2-Methylserine |
| Cholinesulfate | Pestarhamnose B |
| 1,5-Pentanedithiol | 1-palmitoylglycerol;MAG(16:0) |
| (-)-alpha-Cedrene | Hexadeca-5,9-dienoic acid |
| (-)-Isopiperitenone | C6H16N3P3 |
| (-)-monascumic acid | Tolycaine |
| (-)-trans-Carveol | n-Amylbenzene |
| (+)-Bornane-2_5-dione | Guanosine |
| (+)-ethyl homononactate | (9Z)-Tetradecenoicacid |
| (+)-exo-5-Hydroxycamphor | Ethyl3-oxohexanoate |
HeatMaps from the pls data. To see more information run the code in the console and consulte the graph.